Nature Medicine — Latest Matching Preprints

1

AI-Driven Longitudinal Characterization of Neonatal Health and Morbidity

De Francesco, D.; Reiss, J. D.; Roger, J.; Tang, A. S.; Chang, A. L.; Becker, M.; Phongpreecha, T.; Espinosa, C.; Morin, S.; Berson, E.; Thuraiappah, M.; Le, B. L.; Ravindra, N. G.; Payrovnaziri, S. N.; Mataraso, S.; Kim, Y.; Xue, L.; Rosenstein, M.; Oskotsky, T.; Maric, I.; Gaudilliere, B.; Carvalho, B.; Bateman, B. T.; Angst, M. S.; Prince, L. S.; Blumenfeld, Y. J.; Benitz, W. E.; Fuerch, J. H.; Shaw, G. M.; Sylvester, K. G.; Stevenson, D. K.; Sirota, M.; Aghaeepour, N.

2022-04-05 pediatrics 10.1101/2022.03.31.22273233 medRxiv

Top 0.1%

52.4%

Show abstract

While prematurity is the single largest cause of death in children under 5 years of age, the current definition of prematurity, based on gestational age, lacks the precision needed for guiding care decisions. Here we propose a longitudinal risk assessment for adverse neonatal outcomes in newborns based on a multi-task deep learning model that uses electronic health records (EHRs) to predict a wide range of outcomes over a period starting shortly after the time of conception and ending months after birth. By linking the EHRs of the Lucile Packard Childrens Hospital and the Stanford Healthcare Adult Hospital, we developed a cohort of 22,104 mother-newborn dyads delivered between 2014 and 2018. This enabled a unique linkage between long-term maternal information and newborn outcomes. Maternal and newborn EHRs were extracted and used to train a multi-input multi-task deep learning model, featuring a long short-term memory neural network, to predict 24 different neonatal outcomes. An additional set of 10,250 mother-newborn dyads delivered at the same Stanford Hospitals from 2019 to September 2020 was used to independently validate the model, followed by a separate analysis of 12,256 mothers-newborn dyads at the University of California, San Francisco. Moreover, comprehensive association analysis identified multiple known and new associations between various maternal and neonatal features and specific neonatal outcomes. To date, this is the largest study utilizing linked EHRs from mother-newborn dyads and would serve as an important resource for the investigation and prediction of neonatal outcomes. An interactive website is available for independent investigators to leverage this unique dataset: https://maternal-child-health-associations.shinyapps.io/shiny_app/.

2

Human vs AI Clinical Assessment: Benchmarking a Multimodal Foundation Model Against Multi-Center Expert Judgment on the Mental Status Examination.

Mwangi, B.; Jabbar Abdl Sattar Hamoudi, H.; Sanches, M.; Dogan, N.; Chaudhary, P.; Wu, M.-J.; Zunta-Soares, G. B.; Soares, J. C.; Martin, A.; Soutullo, C. A.

2026-04-20 psychiatry and clinical psychology 10.64898/2026.04.17.26351105 medRxiv

Top 0.1%

52.1%

Show abstract

The Mental Status Examination (MSE) is the cornerstone of the psychiatric evaluation, yet validating artificial intelligence (AI) against the inherent variance of clinical judgment remains a critical bottleneck. Here we introduce a multi-center framework to benchmark the open-weight multimodal foundation model Qwen3-Omni against independent expert panels at two sites, UTHealth and Yale. Evaluating 396 classifications across 10 MSE domains and three longitudinal timepoints of increasing symptom severity, we found that experts achieved substantial agreement (Gwets AC1 = 0.87), whereas the model achieved only moderate alignment (AC1 = 0.70-0.72). Even as the models overall pathology prediction rate approximated the experts, the aggregate equilibrium masked a profound "clinical reasoning gap". Specifically, the model systematically over-predicted observable signs (e.g., speech, affect) while notably failing in inferential domains requiring the interpretation of latent mental content (e.g., delusions, perceptions). A 4-bit quantization analysis of the model confirmed this mechanistically: reducing model capacity disproportionately degraded inferential reasoning while preserving perceptual feature extraction. Furthermore, model-to-expert agreement degraded linearly as clinical complexity intensified across longitudinal visits (Accuracy: T0 = 84.8-87%; T1 = 80-82%; T2 = 71-73%), whereas expert consensus remained robust. Notably, model errors increased 2.3-to-3.4 fold where human experts disagreed. These findings establish inter-expert variance as an essential measurable baseline for psychiatric AI, demonstrating that true clinical translation requires models to move beyond multimodal perceptual extraction to achieve higher-order diagnostic reasoning.

3

Reasoning Over Pre-training: Evaluating LLM Performance and Augmentation in Women's Health

Imprialou, M.; Kaltsas, N.; Oliinyk, V.; Vigrass, T.; Schwarzmann, J.; Rosenthal, R.; Glastonbury, C.; Wigley, C.; Gillam, M.; Kanani, N.; Supramaniam, P.; Granne, I.; Lindgren, C. M.

2025-05-23 obstetrics and gynecology 10.1101/2025.05.22.25328162 medRxiv

Top 0.1%

48.7%

Show abstract

Recent advances in large language models (LLMs) show promise in clinical applications, but their performance in womens health remains underexamined 1. We evaluated LLMs on 2,337 questions from obstetrics and gynaecology, including 1,392 from the Royal College of Obstetricians and Gynaecologists Part 2 examination (MRCOG Part 2) 2, a UK-based test of advanced clinical decision-making, and 945 from MedQA3, a dataset derived from the United States Medical Licensing Examination (USMLE). The best-performing model--OpenAIs o1-preview4 enhanced with retrieval-augmented generation (RAG)5,6--achieved 72.00% accuracy on MRCOG Part 2 and 92.30% on MedQA, exceeding prior benchmarks by 21.6%1. General-purpose reasoning models outperformed domain-specific fine-tuned models such as MED-LM7. We also analyse performance by clinical subdomain and discover lower accuracy in areas like fetal medicine and postpartum care. These findings highlight the importance of reasoning capabilities over domain-specific fine-tuning and demonstrate the value of augmentation methods like RAG for improving accuracy and interpretability8.

4

Solving Emergency Department Triage with Small Language Models

Belski, V.; Lukina, K.

2026-05-05 health policy 10.64898/2026.05.04.26352355 medRxiv

Top 0.1%

43.8%

Show abstract

Emergency department (ED) triage assigns patients a five-level Emergency Severity Index (ESI) score that determines care priority. We investigate the feasibility of automating this process, comparing large commercial models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, MedGemma) against a purpose-built pipeline combining a small extraction model with a deterministic clinical engine, and a 9B-parameter language model trained with structured chain-of-thought supervision and reinforcement learning. Off-the-shelf large models achieve only 45-55% exact ESI accuracy while being impractical for clinical deployment due to privacy constraints, cost, and latency. Our specialized BiomedBERT [4] pipeline achieves 88.9% exact accuracy with 97.2% adjacent accuracy ({+/-}1 ESI) on a 50-case expert-labeled evaluation set, approaching nurse inter-rater agreement. A Qwen3.5-9B model [16] fine-tuned with chain-of-thought supervision achieves 75.0% exact / 97.2% adjacent accuracy on a 36-case narrative evaluation. Ongoing GRPO training [13] with a clinically asymmetric reward function and 2,776 ESI-1 narrative training cases (previously 22, due to a discovered extraction bug) shows strong early reward signal. We document 37+ BERT experiments, multiple LLM training cycles, systematic data quality audits, and the specific engineering decisions that enabled progress, including the discovery that 71% of training labels for altered mental status were false positives.

5

Decoding the hallmarks of GLP-1RA weight-loss super responders

Venkatakrishnan, A.; Murugadoss, K.; Soundararajan, V.

2025-11-17 endocrinology 10.1101/2025.11.15.25340314 medRxiv

Top 0.1%

40.2%

Show abstract

Glucagon-like peptide-1 receptor agonists (GLP-1RAs) have reshaped obesity treatment, yet weight-loss outcomes remain highly uneven in real-world care. Using a federated biomedical platform integrating 23 million de-identified U.S. patient records, we analyzed 135,349 individuals treated with GLP-1RAs and stratified them as "super responders" (>15% weight loss), "moderate responders" (5-15% weight loss), "minimal weight-loss group" (<5% weight loss), and "weight regainers". super responders reversed nearly two decades of age-associated weight gain in one year, representing approximately a decade more weight reversal than moderate responders. Compared with Wegovy (semaglutide), Zepbound (tirzepatide) showed 47% higher odds (CI: 33-61%) of super-response and 30% lower odds (CI: 23-37%) of minimal weight-loss. Likewise, relative to Ozempic (semaglutide), Mounjaro (tirzepatide) showed 284% (CI: 265-304%) higher odds of super-response and 48% (CI: 46-51%) lower odds of minimal weight-loss. AI-enabled curation processed more than 14 million clinical notes and 15 million structured records covering 1,426 disease terms across the year before and after GLP-1RA initiation. Wegovy and Ozempic super responders showed marked post-treatment increases in vomiting compared with pre-treatment baselines, as reflected by pre-to-post rate ratios (RR 0.37, p=0.014 and RR 0.09, p<0.001). In contrast, Zepbound super responders showed significantly lower post-treatment vomiting relative to baseline (RR 2.34, p<0.001), indicating brand-specific gastrointestinal tolerability profiles. Ozempic (RR 0.24, p<0.001) and Mounjaro (RR 0.17, p<0.001) super responders each showed significant post-treatment increases in diagnoses of protein-energy malnutrition, suggesting a need for whole-body compositional imaging to distinguish beneficial fat loss from unintended lean-mass loss. Novel signals for therapeutic expansion also emerged. Compared with pre-treatment baselines, Zepbound showed significantly reduced post-treatment encounters for recurrent major depressive disorder (pre-to-post RR 12.6, p<0.001) and asthma (pre-to-post RR 2.6, p<0.001). Patient stratification prior to therapy initiation revealed pre-treatment signatures that can guide GLP-1RA choice, with Zepbound super responders showing lower sleep apnea prevalence (baseline RR 0.42, p<0.001) and higher muscle stiffness prevalence (baseline RR 2.4, p=0.037). This study pinpoints actionable physiological signatures and GLP-1RA brand-specific opportunities that emerge from heterogeneous real-world responses, outlining a map for guided precision obesity interventions.

6

Learning the natural history of human disease with generative transformers

Shmatko, A.; Jung, A. W.; Gaurav, K.; Brunak, S.; Mortensen, L.; Birney, E.; Fitzgerald, T.; Gerstung, M.

2024-06-07 epidemiology 10.1101/2024.06.07.24308553 medRxiv

Top 0.1%

33.7%

Show abstract

Decision-making in healthcare relies on the ability to understand patients past and current health state to predict, and ultimately change, their future course. Artificial intelligence (AI) methods promise to aid this task by learning patterns of disease progression from large corpora of health records to predict detailed outcomes for an individual. However, the potential of AI has not yet been fully investigated at scale. Here, we modify the GPT (generative pretrained transformer) architecture to model the temporal progression and competing nature of human diseases in a population scale cohort. We train this model, termed Delphi-2M, on data from 0.4 million participants of the UK Biobank and validate it using external data from 1.9 million Danish individuals with no change in parameters. Delphi-2M predicts the rates of more than 1,000 different ICD-10 coded diseases and death, conditional on each individuals past disease history, age, sex and baseline lifestyle information, and with accuracy comparable to existing single-disease models. Delphi-2Ms generative nature also enables sampling future health trajectories at any point within an individuals life course with outcomes across the entire disease spectrum. Sampled health trajectories provide meaningful estimates of future disease burden for up to 20 years and enable training AI models which have never seen actual data. Explainable AI methods provide insights into Delphi-2Ms predictions, revealing temporal clusters of co-morbidities within and across different disease chapters and their time-dependent consequences on the future health course. These analyses, however, also reveal that biases underlying the available training data, which in the case of the UK Biobank stem from distinct healthcare sources, are learned and highlighted. In summary, GPT-based models appear well suited for predictive and generative health-related tasks, are applicable to population scale health data sets and provide insights into the temporal dependencies of past events that shape future health, impacting our ability to obtain an instantaneous view of personalised health state.

7

A novel variant of interest of SARS-CoV-2 with multiple spike mutations is identified from travel surveillance in Africa

de Oliveira, T.; Lutucuta, S.; Nkengasong, J.; Morais, J.; Paula Paixao, J.; Neto, Z.; Afonso, P.; Miranda, J.; David, K.; Ingles, L.; Carralero, A. P. A. P. R. R.; Freitas, H. R.; Mufinda, F.; Tessema, S. K.; Tegally, H.; San, E. J.; Wilkinson, E.; Giandhari, J.; Pillay, S.; Giovanetti, M.; Naidoo, Y.; Singh, L.; Tshiabuila, D.; Martin, D.; Lessells, R. J.

2021-04-04 infectious diseases 10.1101/2021.03.30.21254323 medRxiv

Top 0.1%

33.2%

Show abstract

At the end of 2020, the Network for Genomic Surveillance in South Africa (NGS-SA) detected a SARS-CoV-2 variant of concern (VOC) in South Africa (501Y.V2 or PANGO lineage B.1.351)1. 501Y.V2 is associated with increased transmissibility and resistance to neutralizing antibodies elicited by natural infection and vaccination2,3. 501Y.V2 has since spread to over 50 countries around the world and has contributed to a significant resurgence of the epidemic in southern Africa. In order to rapidly characterize the spread of this and other emerging VOCs and variants of interest (VOIs), NGS-SA partnered with the Africa Centres for Disease Control and Prevention and the African Society of Laboratory Medicine through the Africa Pathogen Genomics Initiative to strengthen SARS-CoV-2 genomic surveillance across the region.

8

FeverIQ - A Privacy-Preserving COVID-19 SymptomTracker with 3.6 Million Reports

Ranjan, A.; Li, S.; Chen, B.; Chiu, A.; Jagadeesh, K.; Liphardt, J.

2020-09-25 public and global health 10.1101/2020.09.23.20200006 medRxiv

Top 0.1%

33.0%

Show abstract

Population-scale COVID-19 management benefits from timely and honest information from billions of people. Here, we provide a first report on the FeverIQ symptom tracker, a global effort to collect symptom and test data which has received more than 3.6 million submissions. Unlike other trackers, FeverIQ uses secure multiparty computation (SMC) to cryptographically guarantee user privacy while providing insights to scientists and public health efforts. We performed basic integrity checks of the FeverIQ dataset, such as by comparing it to other publicly released data. We then trained a linear classifier on diagnosis scores which were computed securely, without unprotected symptom data ever leaving a users phone or computer. FeverIQ is currently the worlds largest application of SMC in a health context, demonstrating the practicality of privacy-preserving analytics for population-scale digital health interventions.

9

MICNet: Prediction of antibiotic susceptibility from microscopic images using transfer learning

Viehweger, A.; Hölzer, M.; Brandt, C.

2022-04-21 infectious diseases 10.1101/2022.04.19.22269518 medRxiv

Top 0.1%

33.0%

Show abstract

Rapid susceptibility testing of bacterial isolates is crucial for anti-infective therapy, especially in critical cases such as bacteriaemia and sepsis. Nevertheless, empiric therapy is often initiated immediately and without testing because two days and more pass between a positive blood culture and a susceptibility profile, so in the meantime, the most likely pathogens are treated. However, current empiric recommendations are very generic. They often remain unmodified even in light of incoming, early data specific to a patients case, such as positive blood culture microscopy. Part of the hesitancy to change treatments presumably stems from a lack of systematic integration of early information beyond expert intuition. To enable targeted antimicrobial therapy earlier in a cases progression, we developed a method to predict antimicrobial susceptibility from microscopy images of bacteria alone. Our proof-of-concept MICNet combines two neural nets in a new chimerical architecture. It is pre-trained on about 100 thousand antibiograms and fine-tuned with only five thousand microscopic images through transfer learning. Predicting susceptibility profiles of four representative species, we show high predictive performance with a mean F-score of nearly 85%. In addition, several qualitative assessments show that our chimerical net has learned substantial expert knowledge. Therefore, MICNet is the first step towards personalized empiric therapy, combining prior pathogen probabilities with patient-specific data.

10

Covid-19 Will Reduce US Life Expectancy at Birth by More Than One Year in 2020

Heuveline, P.

2020-12-04 public and global health 10.1101/2020.12.03.20243717 medRxiv

Top 0.1%

32.3%

Show abstract

On December 3rd, 2020, the cumulative number of U.S. Covid-19 deaths tallied by Johns Hopkins University (JHU) online dashboard reached 275,000, surpassing the number at which life table calculations show Covid-19 mortality will lower the U.S. life expectancy at birth (LEB) for 2020 by one full year. Such an impact on the U.S. LEB is unprecedented since the end of World War II. With additional deaths by the year end, the reduction in 2020 LEB induced by Covid-19 deaths will inexorably exceed one year. Factoring the expected continuation of secular gains against other causes of mortality, the U.S. LEB should still drop by more than a full year between 2019 and 2020. By comparison, the opioid-overdose crisis led to a decline in U.S. LEB averaging .1 year annually, from 78.9 years in 2014 to 78.6 years in 2017. At its peak, the HIV epidemic reduced the U.S. LEB by .3 year in a single year, from 75.8 years in 1992 to 75.5 years in 1993. As of now, the US LEB is expected to fall back to the level it first reached in 2010. In other words, the impact of Covid-19 on U.S. mortality can be expected to cancel a decade of gains against all other causes of mortality combined.

11

A Multimodal Framework for Organ- and Cell-Resolved Biological Aging and Longevity Intervention Discovery

Al Dajani, S. A.; Williams, J. R.; Fuentealba, M.; Zhai, T.; Furman, D.; Snyder, M.; Abudayyeh, O. O.; Gootenberg, J. S.; Gladyshev, V. N.

2026-05-12 geriatric medicine 10.64898/2026.05.08.26352759 medRxiv

Top 0.1%

32.1%

Show abstract

Aging is the primary driver of chronic disease and mortality, requiring comprehensive frameworks for quantification of aging and nomination of longevity interventions. We developed mAge (multimodal age), a biological aging framework that integrates plasma proteomics, wearables, and mortality hazard to predict biological age, intrinsic capacity, and mortality risk. By combining proteomic and wearable data in UK Biobank samples, mAge exceeds unimodal baseline age prediction to 0.87 test R{superscript 2} and 2.3 years mean error, and reduces unimodal baseline mortality prediction error by 21%. We further constructed organ-and cell type-specific biological clocks that quantify aging across 49 distinct subsystems, revealing that cardiac, immune, and intracellular protein signatures benefit most from wearable integration. By mapping data to FDA-approved drug targets, we identified interventions, such as GLP-1 receptor agonists, gabapentin, and ACE inhibitors, that are associated with lower overall and subsystem-specific proteomic age and mortality risk or are associated with longer time-to-death and later age-at-death in longitudinal and deceased cohorts. mAge establishes a scalable framework for nominating and validating personalized longevity interventions, bridging continuous digital monitoring with molecular aging diagnostics.

12

Early prediction and fairness evaluation of perinatal depression using EHR: A study of 18,000+ Pregnancies

Sarwal, V.; Pimplaskar, A.; Richards, M.; Sobowale, K.; Chiang, J. N.; Loohuis, L. O.

2025-07-03 psychiatry and clinical psychology 10.1101/2025.06.19.25329946 medRxiv

Top 0.1%

32.0%

Show abstract

Perinatal depression (PND), defined as a depressive illness occurring during pregnancy or following childbirth, affects between 10-20% of mothers. It is one of the greatest causes of mortality and morbidity in mothers and is associated with poor outcomes in children. Early identification of at-risk mothers has the potential to greatly reduce its impact. While specific risk factors for PND have been identified, most notably a history of prior depression, it is unclear whether mothers Electronic Health Records (EHR) can be used early in pregnancy to predict who will go on to develop PND, especially in mothers without a history of prior depression. In this paper, we used clinical EHR data from the UCLA health system to develop predictive models of perinatal depression at a patients first prenatal visit (n = 18,081 pregnant mothers, n=4,307 with PND). We used a variety of predictive models, including Ridge Regression, Gradient Boosting Trees, Random Forests, and ExtraTrees. We performed separate analyses including only mothers without a history of prior depression. We further evaluated the robustness and fairness of our algorithms comparing models stratified by self-reported ethnoracial group and social determinants of health (e.g., social vulnerability index (SVI)). All model architectures used perform similarly. The Random Forest model provided robust performance with the highest accuracy and well-balanced sensitivity and specificity (AUROC 0.75, CI [0.66,0.84] in the full cohort). However, performance was reduced among mothers without prior depression (AUROC 0.71, CI [0.6,0.8]). Important risk factors identified by our model include known risk factors, such as prior mental health histories (prior depression, anxiety disorders), socioeconomic factors (social vulnerability), patient vitals (blood pressure), and measures of inflammation in blood (white blood cell counts, platelet counts), as well as novel ones (patient pulse, mean platelet volume (MPV), red blood cell distribution width (RDWSD) and rapid plasma reagin (RPR)). We observed similar model performance when stratifying our cohort by social determinants of health, with overlapping ROC bounds, equalized odds ratios between groups close to 0.8, and largely overlapping predictors of importance across models. This was not the case for ethnoracial groups, where despite observing top predictive features varied by ethnoracial category.

13

Integrating Infection Burden and Multimodal Biomarkers for Early Detection of Alzheimers Disease: A Sheaf-ML Framework.

Thakur, L. S.; Bharj, G.; Nguyen, D.-T.; Saroya, M.; Malik, B.

2025-11-13 neurology 10.1101/2025.11.10.25339915 medRxiv

Top 0.1%

28.7%

Show abstract

Alzheimers disease (AD) remains a major global health challenge, with growing evidence linking chronic infections, immune aging, and neurodegeneration. Grounded in the Antimicrobial Protection Hypothesis, this study introduces a sheaf-theoretic machine learning framework, Sheaf-ML, for integrating multimodal health data and assessing infection-related cognitive risk. Sheaf-ML constructs a unified patient-level representation that coherently combines diverse data streamsincluding serological infection markers, cognitive assessments, cardiovascular and metabolic measures, nutritional and behavioral evaluationswhile preserving the intrinsic structure and relationships of each modality. Applying this framework to the Harmonized LASI-DAD dataset (N = 6168), we modeled six clinically motivated domains (Infection, Cognition, Mental Health, Cardiovascular, Nutrition, and Demographics) and integrated them into a topologically consistent representation using learnable cross-domain mappings and consistency constraints. The sheaf-integrated embeddings revealed clinically meaningful interactions: infection burden was linked with cardio-vascular, nutritional, and cognitive outcomes, highlighting system-level coordination across modalities. Using these embeddings, Sheaf-ML produced interpretable patient-level predictions and identified the most influential features both globally and individually. We further derived an Infection Burden Index (IBI), which quantified patient-level infection-related risk. Patients exceeding the 80th percentile were flagged as early-warning cases, corresponding to approximately 20% of the cohort, demonstrating actionable stratification for clinical monitoring. This study provides the first empirical evidence that sheaf-based architectures can integrate multimodal health data in a clinically interpretable manner, uncover biologically meaningful interactions, and support patient-specific risk prediction. By linking population-level patterns with individualized insights, Sheaf-ML establishes a foundation for scalable, interpretable, and equitable precision models of infection-related cognitive decline in Alzheimers disease.

14

Integrative, and Scalable mental health phenotyping using a knowledge-graph-derived dual-metric framework

Sharma, A.; Bharadwaj, A.; Modi, S.; Ahuja, G.; Jain, A.; Kumar, K.

2026-03-16 psychiatry and clinical psychology 10.64898/2026.03.09.26347798 medRxiv

Top 0.1%

28.4%

Show abstract

Prevailing diagnostic instruments for anxiety and depression, though clinically indispensable, remain anchored to symptom-focused queries that assess patients directly about their affective states, while often neglecting the multidimensional architecture of daily living. Here, we introduce two complementary metrics, the Cognitive Attention Score (CAS) and C:ERR (Cognition-to-Emotional-Response Ratio), derived from yogic psychology and operationalized within a structured knowledge graph (Ceekr-KG) comprising 151,288 triples linking 354 discrete CAS levels, 26 continuous C:ERR values, and 80 clinical symptoms. Rather than interrogating disease phenotypes directly, these metrics are computed by capturing circadian, nutritional, and lifestyle factors that jointly regulate cognitive and emotional homeostasis. Hyperparameter-tuned Ceekr-KG model demonstrated high structural fidelity (Hits@1 = 97%, mean reciprocal rank = 0.98), substantially outperforming relation-preserving randomized controls, indicating that predictive performance arises from semantic structure rather than graph topology alone. CAS and C:ERR showed a strong positive association (Spearmans {rho} = 0.787, p < 0.0001) but exhibited distinct distributional properties, with C:ERR displaying consistently stronger inverse correlations with symptom severity across domains (e.g., low energy: {rho} = -0.85 versus -0.70 for CAS). Ordinal regression further showed that a combined CAS and C:ERR model outperformed either metric alone for most symptoms, indicating complementary and non-redundant contributions to clinical variance. Integration of Ceekr-KG into the independent Clinical Knowledge Graph improved predictive performance of widely used questionnaire-based assessment scales, demonstrating that yogic psychological frameworks encode clinically relevant semantic information. Finally, longitudinal analysis of 249 individuals meeting predefined inclusion criteria (baseline CAS < 64 and >=2 assessments) across three therapeutic programmes revealed a mean CAS increase of +11.45 points (p < 0.001) and substantial migration from lower to higher functional bands, establishing Ceekr-KG as a validated digital phenotype for scalable mental health assessment.

15

Artificial Intelligence Agents in Mental Health: A Systematic Review and Meta Analysis

Zhu, L.; Wang, W.; Liang, Z.; Tan, W.; Chen, B.; Lin, X.; Wu, Z.; Yu, H.; Li, X.; Jiao, J.; He, S.; Dai, G.; Niu, J.; Zhong, Y.; Hua, W.; Chan, N. Y.; Lu, L.; Wing, Y. K.; Ma, X.; Fan, L.

2026-04-22 psychiatry and clinical psychology 10.64898/2026.04.21.26351365 medRxiv

Top 0.1%

27.4%

Show abstract

The rapid rise of large language models (LLMs) and foundation models has accelerated efforts to build artificial intelligence (AI) agents for mental health assessment, triage, psychotherapy support and clinical decision assistance. Yet a gap persists between healthcare and AI-focused work: while both communities use the language of "agents," clinical research largely describes monolithic chatbots, whereas AI studies emphasize agentic properties such as autonomous planning, multiagent coordination, tool and database use and integration with multimodal mental health data streams. In this Review, we conduct a systematic analysis of mental health AI agent systems from 2023 to 2025 using a six-dimensional audit framework: (i) system type (base model lineage, interface modality and workflow composition, from rule-based tools to role-aware multi-agent foundation-model systems), (ii) data scope (modalities and provenance, from elicited self-report and chatbot dialogues to electronic health records, biosensing and synthetic corpora), (iii) mental health focus (mapped to ICD-11 diagnostic groupings), (iv) demographics (age strata, geography and sex representation), (v) downstream tasks (screening/triage, clinical decision support, therapeutic interventions, documentation, ethical-legal support and education/simulation) and (vi) evaluation types (automated metrics, language quality benchmarks, safety stress tests, expert review and clinician or patient involvement). Across this corpus, we find that most systems (1) concentrate on depression, anxiety and suicidality, with sparse coverage of severe mental illness, neurocognitive disorders, substance use and complex comorbidity; (2) rely heavily on text-based self-report rather than clinically verified longitudinal data or genuinely multimodal inputs; (3) are implemented as single-agent chatbots powered by general-purpose LLMs rather than role-structured, workflow-integrated pipelines; and (4) are evaluated primarily via offline metrics or vignette-based scenarios, with few prospective, clinician- or patient-in-the-loop studies. At the same time, an emerging class of agentic systems assigns foundation models explicit roles as planners, retrieval agents, safety auditors or supervisors coordinating other models and tools. These multiagent, tool-augmented workflows promise personalization, safety monitoring and greater transparency, but they also introduce new risks around reliability, bias amplification, privacy, regulatory accountability and the blurring of clinical versus non-clinical roles. We conclude by outlining priorities for the next generation of mental health AI agents: clinically grounded, role-aware multi-agent architectures; transparent and privacy-preserving use of clinical and elicited data; demographic and cultural broadening beyond predominantly Western adult samples; and evaluation pipelines that progress from offline benchmarks to longitudinal, real-world studies with routine safety auditing and clear governance of responsibilities between agents and human clinicians.

16

Harnessing Mechanistic Simulators for Rapid Diagnostic Test Capture and Deep Learning Classification

Rogers, E.; Turbe, V.; Gareta, D.; Herbst, C.; Herbst, K.; Shahmanesh, M.; McKendry, R. A.

2025-02-25 public and global health 10.1101/2025.02.25.25322677 medRxiv

Top 0.1%

26.4%

Show abstract

Rapid diagnostic tests (RDTs) support affordable disease diagnosis. Machine learning (ML) can improve RDT interpretation but often relies on large, proprietary, and costly real-world image libraries. We present SynSight - a ML-enabled RDT segmentation and classification pipeline trained on synthetic data. Validated on HIV (98% sensitivity, 99% specificity) and COVID-19 RDTs (up to 99% accuracy), SynSight enables rapid ML training without real-world images, keeping pace with new RDT development.

17

Disentangling Symptom Heterogeneity in Large-Scale Psychiatric Text: Domain-Adapted vs. Instruction-Tuned Transformers

Varone, G.; Kumar, P.; Brown, J.; Boulila, W.

2026-02-26 psychiatry and clinical psychology 10.64898/2026.02.24.26347006 medRxiv

Top 0.1%

25.9%

Show abstract

Psychiatric disorders are fundamentally challenged by symptom heterogeneity, high comorbidity, and the absence of objective biomarkers, which together result in substantial variability in clinical assessment and treatment selection. Patient-generated language captures rich information about subjective experience and symptom severity, which can be systematically encoded and analyzed using computational models, making it a scalable signal for psychiatric assessment. We compare two approaches: (i) a domain-specialized transformer fine-tuned on clinical language, based on the Bio-ClinicalBERT encoder architecture, and (ii) a large-scale instruction-tuned generalist encoder (Instructor-XL) used as a frozen feature extractor with a shallow classification head. A corpus of N = 151,228 de-identified texts was compiled from five public sources, covering four psychiatric phenotypes: anxiety, depression, schizophrenia, and suicidal intention. Models were evaluated using stratified 10-fold cross-validation with cost-sensitive training, prioritizing imbalance-aware metrics, including Macro-F1 and Matthews Correlation Coefficient (MCC), over accuracy. Bio-ClinicalBERT achieved superior overall performance (Macro-F1 = 0.78, MCC = 0.6752), indicating more reliable separation of diagnostically overlapping affective categories. In contrast, Instructor-XL achieved its highest class-specific performance for schizophrenia (F1 = 0.798). Explainability analyses suggest that the domain-specialized model places greater weight on clinically relevant terms, whereas the generalist model relies on a broader set of lexical features.

18

MedMisBench: Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

Zhou, H.; Zou, X.; Wu, J.; Wu, S.; Wu, J.; Segal, B. M.; Niebuhr, T. E.; Amro, S.; Petrus, M.; Momin, S.; Cardoso Pinto, A.; Niesen, R.; Wegner, L. S.; Darji, D.; Koo, J. M.; Fieggen, J.; Narain, K.; Zeng, M.; Clifton, L.; Shapiro, L.; Liu, F.; Clifton, D. A.

2026-05-28 bioengineering 10.64898/2026.05.25.727671 medRxiv

Top 0.1%

25.9%

Show abstract

Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.1

19

Precision stratification of risk for suicidal behavior in people with bipolar depression

de Lacy, N.; Lam, W. Y.; Virtosu, M.; Deshmukh, V.; Wilson, F. A.; Pescosolido, B.; Smith, K. R.

2026-02-25 psychiatry and clinical psychology 10.64898/2026.02.23.26346921 medRxiv

Top 0.1%

25.7%

Show abstract

Patients with bipolar depression are at the highest risk for suicidal behavior, comprising [~]10% of all deaths. In the critical period preceding attempts, most are not in contact with mental health professionals to effect antisuicidal strategies. There is an urgent need for decision support tools to help nonspecialist providers identify those at elevated risk to facilitate prevention. However, we lack robust, performant predictive models to form the core of such tools. Here, we build a high-precision predictive model of 30-day risk for suicidal behavior using unique electronic health record data from >220,000 patients with bipolar depression. We show that optimized machine learning approaches offer very strong clinical utility, delivering high Standardized Net Benefit in the context of near-perfect calibration and smooth, threshold-robust decision curves. Our results break the longstanding performance ceiling in suicide risk prediction and highlight the importance of training models for clinical utility as well as discriminative skill.

20

Scalable, non-invasive depression monitoring with smartphone speech: a multimodal benchmark and topic analysis

Emden, D.; Gutfleisch, L.; Herpertz, J.; Leenings, R.; Blitz, R.; Holstein, V. L.; Goltermann, J.; Richter, M.; Chevance, A.; Fleuchaus, A.; Winter, N. R.; Spanagel, J.; Meinert, S.; Borgers, T.; Flinkenflugel, K.; Stein, F.; Alexander, N.; Jamalabadi, H.; Leehr, E. J.; Redlich, R.; Ebner-Priemer, U.; Nenadic, I.; Kircher, T.; Dannlowski, U.; Hahn, T.; Opel, N.

2025-07-18 psychiatry and clinical psychology 10.1101/2025.07.17.25331744 medRxiv

Top 0.1%

25.6%

Show abstract

Objective, scalable biomarkers are needed for continuous monitoring of major depressive disorder (MDD). Smartphone-collected speech is promising, yet extracting clinically useful signals remains difficult. We analysed 3 151 weekly voice diaries from 284 German-speaking adults (128 MDD, 156 controls) and regressed Beck Depression Inventory (BDI) scores. Sentence embeddings from the open-source 8-billion-parameter Qwen3-8B model predicted scores with MAE = 4.45 and R2 = 0.35, explaining 16 more points of variance than the best traditional feature set (TF-IDF). Adding lexical-prosodic or TF-IDF features provided only marginal improvement (best MAE = 4.39). To interpret the embeddings we applied BERTopic and uncovered ten coherent themes; BDI scores peaked for "Persistent Low Mood" and "Pain Distress", confirming clinical relevance. Large-language-model embeddings therefore capture the dominant signal of depression severity in everyday speech and, paired with interpretable topic analysis, offer a privacy-preserving, scalable route to digital mental-health phenotyping.